1- Problem:

Stroke is the second-leading cause of death and the most common global cause of disability. WHO estimates that 1 in 4 persons may experience a stroke during their lifetime; because strokes can occur at any time and to anyone, regardless of age, we have chosen to concentrate on this dataset. Given the sudden nature of strokes, we intend to investigate and analyze the data to provide predictions on what are some risk factors and shed light on the types of people who are likely to experience one, allowing for future changes in lives. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age,bmi, various diseases such as hypertension and heart disease, smoking status,marital status and residence type. Each row in the data provides relavant information about the patient.

2- Data mining task:

Data mining plays a crucial role in predicting the probability of having a stroke through classification and clustering techniques. By applying data mining algorithms to a large dataset containing various health-related features, valuable patterns and relationships can be discovered. In the classification aspect, data mining aids in building models that can accurately classify individuals into different categories, such as 1 for “stroke” or 0 for “non-stroke,” based on their attributes and risk factors. This helps in identifying individuals who are more likely to experience a stroke, enabling proactive interventions and preventive measures. On the other hand, clustering techniques assist in identifying groups or clusters of individuals with similar characteristics, allowing for a deeper understanding of stroke risk factors and potential subgroups within the population. By leveraging data mining in stroke prediction, healthcare professionals and researchers can gain valuable insights and develop effective strategies for stroke prevention, early detection, and personalized treatments.

3- Dataset information:

Our dataset source is : https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

data<-read.csv("Dataset/healthcare-dataset-stroke-data.csv")
head(data)
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1

Genral info about the dataset:

Among the 5110 objects in our dataset sample, 12 attributes are used to describe them. Our characteristics’ values are utilized to identify their types, such as the nominal for id, binary for gender, and numeric for age.
Additionally, we had two attributes for hypertension and heart disease that took two values 1 and 0 to indicate whether they are sufferd from it or not, respectively. The last attribute, “stroke”, was described by two values 0 and 1 for the possibility of having a stroke or not as a result of analysis of the previous data, , which is what we aim to train our model to predict.

Data dictionary:

Attribute Name Description Data Type Possible values
id Unique id of the patient Nominal Range between 67-72940
gender Gender of the patient Binary Female
Male
age Age of the patient Numeric Range between 0.08-82
hypertension Hypertension binary feature, 1 means the patient has hypertension, 0 means they do not. Binary 0,1
heart_disease Heart disease binary feature, 1 means the patient has heart disease, 0 means they do not. Binary 0,1
ever_married Has the patient ever been married? Binary Yes
No
work_type Work type of the patient Nominal “Private”
“Self-employed”
“children”
“Govt_job”
“Never_worked”
residence_type Residence type of the patient Binary “Urban”
“Rural”
avg_glucose_level Average glucose level in blood Numeric Range between 55.1-272
bmi Body Mass Index Numeric Range between 10.3-97.6
smoking_status Smoking status of the patient Nominal “never smoked”
“Unknown”
“formerly smoked”
“smokes”
stroke Stroke event, 1 means the patient had a stroke, 0 means not Binary 0,1
str(data)
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : chr  "36.6" "N/A" "32.5" "34.4" ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...
#Number of rows
nrow(data)
## [1] 5110
#Number of column
ncol(data)
## [1] 12
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(data)
## data 
## 
##  12  Variables      5110  Observations
## --------------------------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0     5110        1    36518    24436     3590     6972 
##      .25      .50      .75      .90      .95 
##    17741    36932    54682    65668    69218 
## 
## lowest :    67    77    84    91    99, highest: 72911 72914 72915 72918 72940
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##     5110        0        3 
##                                
## Value      Female   Male  Other
## Frequency    2994   2115      1
## Proportion  0.586  0.414  0.000
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0      104        1    43.23    26.03        5       11 
##      .25      .50      .75      .90      .95 
##       25       45       61       75       79 
## 
## lowest : 0.08 0.16 0.24 0.32 0.4 , highest: 78   79   80   81   82  
## --------------------------------------------------------------------------------
## hypertension 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     5110        0        2    0.264      498  0.09746    0.176 
## 
## --------------------------------------------------------------------------------
## heart_disease 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     5110        0        2    0.153      276  0.05401   0.1022 
## 
## --------------------------------------------------------------------------------
## ever_married 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value         No   Yes
## Frequency   1757  3353
## Proportion 0.344 0.656
## --------------------------------------------------------------------------------
## work_type 
##        n  missing distinct 
##     5110        0        5 
##                                                                   
## Value           children      Govt_job  Never_worked       Private
## Frequency            687           657            22          2925
## Proportion         0.134         0.129         0.004         0.572
##                         
## Value      Self-employed
## Frequency            819
## Proportion         0.160
## --------------------------------------------------------------------------------
## Residence_type 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value      Rural Urban
## Frequency   2514  2596
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## avg_glucose_level 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0     3979        1    106.1    45.38    60.71    65.79 
##      .25      .50      .75      .90      .95 
##    77.24    91.88   114.09   192.18   216.29 
## 
## lowest : 55.12  55.22  55.23  55.25  55.26 , highest: 266.59 267.6  267.61 267.76 271.74
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct 
##     5110        0      419 
## 
## lowest : 10.3 11.3 11.5 12   12.3, highest: 71.9 78   92   97.6 N/A 
## --------------------------------------------------------------------------------
## smoking_status 
##        n  missing distinct 
##     5110        0        4 
##                                                                           
## Value      formerly smoked    never smoked          smokes         Unknown
## Frequency              885            1892             789            1544
## Proportion           0.173           0.370           0.154           0.302
## --------------------------------------------------------------------------------
## stroke 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     5110        0        2    0.139      249  0.04873  0.09273 
## 
## --------------------------------------------------------------------------------
#Five number summary:
summary(data)
##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level     bmi            smoking_status         stroke       
##  Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :106.15                                          Mean   :0.04873  
##  3rd Qu.:114.09                                          3rd Qu.:0.00000  
##  Max.   :271.74                                          Max.   :1.00000
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data %>% summarise_if(is.numeric, var)
##          id      age hypertension heart_disease avg_glucose_level     stroke
## 1 447818462 511.3318   0.08797552    0.05110447          2050.601 0.04636264

The provided summary data provides information on various variables in a dataset. The dataset includes information on ID, gender, age, hypertension, heart disease, marital status, work type, residence type, average glucose level, body mass index (BMI), smoking status, and stroke occurrence.

Analyzing the summary, we can observe that the dataset consists of 5,109 observations or rows. Regarding gender distribution, the majority of individuals in the dataset are female (2,994), followed by males (2,115), with only one observation classified as “other.” The age range of the individuals in the dataset spans from a minimum of 0.08 to a maximum of 82 years, with a mean age of 43.23. Approximately 9.75% of the individuals have hypertension, and around 5.40% have heart disease.

Examining the marital status variable, the majority of individuals are married (3,353), while a smaller proportion are not married (1,757). The work type variable indicates that the most common category is “private” (2,925), followed by “self-employed” (819), “govt_job” (657), and “children” (687). In terms of residence type, the dataset is roughly evenly split between rural (2,514) and urban (2,596) areas.

The average glucose level of the individuals ranges from a minimum of 55.12 to a maximum of 271.74, with a mean of 106.15. The BMI variable contains missing values (N/A) for 201 observations, and the remaining values vary between 26.1 and 28.7, with mode values of 28.4 and 26.7. Regarding smoking status, the majority of individuals have an unknown smoking status (1,544), followed by “never smoked” (1,892), “formerly smoked” (885), and “smokes” (789). Finally, the stroke variable indicates that 4.87% of the individuals in the dataset experienced a stroke.

In summary, the dataset consists of a diverse population in terms of gender, age, and various health-related factors. It includes information on hypertension, heart disease, and stroke occurrences, as well as details on lifestyle factors such as marital status, work type, residence type, and smoking status. The dataset can be used for further analysis and exploration of factors associated with stroke occurrence and potential risk factors within the given population.

# import necessary libraries
library(tidyverse)   
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(lubridate)
library(magrittr)
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(dplyr)
library(tidyr) 
library(readr)
library(outliers)
library(caret) 
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(DMwR2) 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tidyverse)
library(lubridate)
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

4- Understanding the data through graph representations:

# check gender
data <- data %>% filter(gender != "Other")
ggplot(data, aes(x = gender, fill = as.factor(stroke))) +
  geom_bar(position = "fill") +
  labs(fill = "STROKE", title= "Incident of stroke between genders")

The total number of strokes by gender in our data set are shown in the bar chart. It demonstrates if there is a correlation between gender and the frequency of stroke cases in each gender. We can see that men are slightly more likely than women to experience a stroke.


tab <- data$work_type %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100 
txt <- paste0(names(tab), '\n', precentages, '%') # text on chart
pie(tab, labels=txt ,main= "Chart of employement status") # plot pie chart

This pie chart illustrates the various worker types according to the employment sector type in our data collection. The summarization of our data collection’s is nominal data, which shows people who work for the private sector are present at a higher percentages (57.2%) than those who work for the self-employed sector (16%) and so on.


ggplot(data) + geom_point(mapping = aes(y = age, x = stroke, color = stroke ), alpha =0.9 )+ 
  labs(title = "Distribution of Stroke status by age" ) +
  theme( plot.title = element_text(size = 14, face = "bold"), legend.position = "none", axis.line = element_line(size = 1), axis.ticks = element_line() )
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The above chart shows the age distribution of stroke victims. Our results showed a correlation between age and stroke, showing a greater likelihood of stroke with older age.


ggplot(data, aes(x = ever_married, fill = as.factor(stroke))) +
  geom_bar(position = "fill") +
  labs(fill = "STROKE", title="Correlation between stroke occurence and marrige status")

The correlation between having a stroke and being married is illustrated in a bar graph. We discovered that people who are married have a higher risk of having a stroke than people who are not married.


# Histogram of Average Glucose Level with normal distribution overlay
histglucose <- hist(data$avg_glucose_level,xlim=c(0,300),
                main="Histogram of Avg. Glucose with Normal Distribution Overlay",
                xlab="Avg. Glucose",las=1)
xfit <- seq(min(data$avg_glucose_level),max(data$avg_glucose_level))
yfit <- dnorm(xfit,mean=mean(data$avg_glucose_level),sd=sd(data$avg_glucose_level))
yfit <- yfit*diff(histglucose$mids[1:2])*length(data$avg_glucose_level)
lines(xfit,yfit,col="red",lwd=2)

The average glucose levels of the patients in the study are right skewed, with mean of 106.15 from the summary() function earlier. We notice that there is an increase in frequency when the glucose level reaches 200, which makes us wonder if this elevation is a factor for a stroke.

4- Data Preprocessing:

In our data, we performed several data preprocessing techniques such as data cleaning, data normalization, removing outliers and null values, etc. We’ll go through each one in details below.

4.1 Remove Null values:

Null values can cause issues when performing data analysis or building machine learning models, as they can lead to inaccurate results or errors, therefore we will remove them.

# Change "N/A" to actual NULL
data$bmi[data$bmi=="N/A"] <-NA
# Checking missing values
sum(is.na(data))
## [1] 201
# Converting bmi to numeric
data$bmi <- as.numeric(as.character(data$bmi))
# Checking bmi type
class(data$bmi)
## [1] "numeric"
# Replacing null values with the mean
data$bmi[is.na(data$bmi)]<-mean(data$bmi, na.rm = TRUE)

# Missing values
sum(is.na(data))
## [1] 0

4.2 Check duplicated rows

We check if there are any duplicated rows to decrease dataset volume. Hence enhancing data quality which leads to better model performance and speed.

# Checking duplicated rows
sum(duplicated(data))
## [1] 0

4.3 Detect and remove outliers:

We check if there are any outlier values in our dataset, and we detected a few in our dataset. The presence of outliers can have a significant impact on statistical analyses and modeling. Outliers, being data points that deviate significantly from the majority of the data, can distort results and affect the validity and reliability of conclusions drawn from the analysis. Therefore, we must delete them before we start our work in order to prevent it from affecting our results. To address this issue, an outlier detection procedure was applied to the dataset. The detection method employed uses Outliers package that has Outlier() function to help us detect. This approach identified data points that were deemed to be outliers.

#call outliers library
library(outliers)
#detect Age outliers
OutAge <- outlier(data$age)
print(OutAge)
## [1] 0.08

After detecting the outliers of Age attribute, we made sure of their location before deleting them

#checking the outlier location before delete
indices <- which(data$age == OutAge)
# Print the resulting row indices
print(indices)
## [1] 1615 3295
#Remove age outlier
data <- data[data$age != OutAge, ]
#detect Average glucose level outliers
OutAvg <- outlier(data$avg_glucose_level)
print(OutAvg)
## [1] 271.74
#Remove Average glucose level outlier
data <- data[data$avg_glucose_level != OutAvg, ]
#detect bmi outliers
OutBMI <- outlier(data$bmi)
print(OutBMI)
## [1] 97.6
#Remove bmi outlier
data <- data[data$bmi != OutBMI, ]
#check after deleting
#Number of rows
nrow(data)
## [1] 5105
#Number of column
ncol(data)
## [1] 12

We noticed that the number of rows is 5 less which mean the outlier are deleted successfully, and that will help us since the outlier causes a noise in the data it will be smoothed later to have more accurate data.

To make sure that the deletion was successful, we searched for the rows that contain the Outlier values, and the results were all zero, which confirms to us that the deletion was successful.

indices <- which(data$age == OutAge)
# Print the resulting row indices
print(indices)
## integer(0)
indices3 <- which(data$avg_glucose_level == OutAvg)
# Print the resulting row indices
print(indices3)
## integer(0)
indices2 <- which(data$bmi == 97.6)
# Print the resulting row indices
print(indices2)
## integer(0)

4.4 Cleaning the data

#only one column with Gender "Other" 
data[data$gender=="Other", ]
##  [1] id                gender            age               hypertension     
##  [5] heart_disease     ever_married      work_type         Residence_type   
##  [9] avg_glucose_level bmi               smoking_status    stroke           
## <0 rows> (or 0-length row.names)
# delete it 
data = data[data$gender!="Other", ]
##check
table(data$gender)
## 
## Female   Male 
##   2993   2112
## convert stroke to factor(to use scaling)
data$stroke <- as.factor(data$stroke)

4.5 Encoding categorical data:

Encoding is an important step in data mining and machine learning tasks because it transforms raw data into a suitable format that can be effectively processed and analyzed by algorithms. The process of encoding involves converting categorical or textual data into numerical representations.

head(data)
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level      bmi  smoking_status stroke
## 1          Urban            228.69 36.60000 formerly smoked      1
## 2          Rural            202.21 28.89456    never smoked      1
## 3          Rural            105.92 32.50000    never smoked      1
## 4          Urban            171.23 34.40000          smokes      1
## 5          Rural            174.12 24.00000    never smoked      1
## 6          Urban            186.21 29.00000 formerly smoked      1
data$work_type = factor(data$work_type,levels = c("Govt_job","Private", "Self-employed","children","Never_worked"), labels = c(5,4,3,2,1))
data$gender = factor(data$gender, levels = c("Male", "Female"), labels = c(1, 2))
data$ever_married= factor(data$ever_married, levels = c("No", "Yes"), labels = c(0, 1))
data$Residence_type= factor(data$Residence_type, levels = c("Urban", "Rural"), labels=c(1,2))
data$smoking_status= factor(data$smoking_status, levels = c("Unknown","never smoked", "formerly smoked","smokes"), labels=c(1,2,3,4))
head(data)
##      id gender age hypertension heart_disease ever_married work_type
## 1  9046      1  67            0             1            1         4
## 2 51676      2  61            0             0            1         3
## 3 31112      1  80            0             1            1         4
## 4 60182      2  49            0             0            1         4
## 5  1665      2  79            1             0            1         3
## 6 56669      1  81            0             0            1         4
##   Residence_type avg_glucose_level      bmi smoking_status stroke
## 1              1            228.69 36.60000              3      1
## 2              2            202.21 28.89456              2      1
## 3              2            105.92 32.50000              2      1
## 4              1            171.23 34.40000              4      1
## 5              2            174.12 24.00000              2      1
## 6              1            186.21 29.00000              3      1

5- Normalize Data using Min-Max Scaling:

normalization was performed to ensure consistent scaling of the data. The normalization technique applied was the max-min normalization. This technique rescales the values of specific attributes within a defined range between 0 and 1. The following attributes were selected for normalization: age, average glucose level, and BMI (Body Mass Index). We can use the normalized dataset provides a more uniform and comparable representation of the attributes, enabling accurate analysis and modeling for stroke prediction with result as shown.

#normalize data
normalize <- function(x){ return ((x - min(x))/ (max(x)- min(x)))}
data$avg_glucose_level= normalize(data$avg_glucose_level)
data$age= normalize(data$age)
data$bmi= normalize(data$bmi)
head(data)
##      id gender       age hypertension heart_disease ever_married work_type
## 1  9046      1 0.8167155            0             1            1         4
## 2 51676      2 0.7434018            0             0            1         3
## 3 31112      1 0.9755621            0             1            1         4
## 4 60182      2 0.5967742            0             0            1         4
## 5  1665      2 0.9633431            1             0            1         3
## 6 56669      1 0.9877810            0             0            1         4
##   Residence_type avg_glucose_level       bmi smoking_status stroke
## 1              1         0.8162622 0.3219094              3      1
## 2              2         0.6917325 0.2275956              2      1
## 3              2         0.2389014 0.2717258              2      1
## 4              1         0.5460403 0.2949816              4      1
## 5              2         0.5596313 0.1676867              2      1
## 6              1         0.6164880 0.2288862              3      1

6- Discretization

After the process of normalizing and cleaning the data, we found that there is no requirement for feature discretization.

7- Feature selection

We will reduce the number of input variables for our predictive model using a feature selection tool called Recursive Feature Elimination(RFE), which is a widely used algorithm for selecting features that are most relevant in predicting the target variable (stroke in our case) in a predictive model, as well as varImp function which calculates variable importance for objects

# ensure results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)

# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(stroke~., data, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
## ROC curve variable importance
## 
##                   Importance
## age                   0.8343
## ever_married          0.6190
## avg_glucose_level     0.6091
## hypertension          0.5867
## smoking_status        0.5759
## bmi                   0.5732
## heart_disease         0.5692
## work_type             0.5332
## Residence_type        0.5189
## gender                0.5093
## id                    0.5071
# plot importance
plot(importance)

# ensure the results are repeatable
set.seed(7)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(data[,1:11], data[,12], sizes=c(1:11), rfeControl=control)
# summarize the results
print(results)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy      Kappa AccuracySD  KappaSD Selected
##          1   0.9514  0.0000000  0.0008027 0.000000        *
##          2   0.9508  0.0059106  0.0016974 0.022718         
##          3   0.9514  0.0000000  0.0008027 0.000000         
##          4   0.9510  0.0059750  0.0009257 0.021610         
##          5   0.9508 -0.0011340  0.0014248 0.001826         
##          6   0.9510  0.0062336  0.0018376 0.023737         
##          7   0.9510 -0.0007563  0.0012915 0.001594         
##          8   0.9512 -0.0003779  0.0006051 0.001195         
##          9   0.9506 -0.0014857  0.0015179 0.002577         
##         10   0.9510 -0.0007557  0.0009019 0.001593         
##         11   0.9510 -0.0007557  0.0009019 0.001593         
## 
## The top 1 variables (out of 1):
##    age
# list the chosen features
predictors(results)
## [1] "age"
# plot the results
plot(results, type=c("g", "o"))

Both methods show that the age, hypertension, ever_married, avg_glucose_level attributes are the most important factors in relation to stroke. Therefore, we will remove all the other attributes with the low importance (id, Residence_type, gender, heart_disease, bmi, work_type, smoking_status)

#delete the unimportant features coloumns
data <-data[,!names(data) %in% c("id","Residence_type","gender","heart_disease","bmi","work_type","smoking_status")]
head(data)
##         age hypertension ever_married avg_glucose_level stroke
## 1 0.8167155            0            1         0.8162622      1
## 2 0.7434018            0            1         0.6917325      1
## 3 0.9755621            0            1         0.2389014      1
## 4 0.5967742            0            1         0.5460403      1
## 5 0.9633431            1            1         0.5596313      1
## 6 0.9877810            0            1         0.6164880      1

Imbalanced dataset problem:

In the given dataset, it is observed that only around 5% of all the individuals have experienced a stroke at some point. Consequently, our baseline dummy model achieves an accuracy of 95% by consistently predicting that individuals do not have a stroke. When evaluating our model, an essential metric to consider is sensitivity, also known as recall or the probability of detection. Low sensitivity indicates that our model struggles to identify true positive cases, even if the overall accuracy is high. In this case, the dummy model has a sensitivity of 0 since it fails to identify any true positives.

Addressing the class imbalance issue, there are various approaches available. Considering the limited size of the dataset, oversampling the minority class is deemed the most suitable strategy. By oversampling, we increase the representation of the minority class instances, enabling the model to learn and generalize better for this class.

# upscaling the data
library('caret')
library('dplyr')
data<-upSample(data[,-5],data$stroke, yname="stroke")
plot(data$stroke)

data$stroke<- as.factor(data$stroke)
#Finally, checking overall
head(data)
##          age hypertension ever_married avg_glucose_level stroke
## 1 0.03470186            0            0        0.18811136      0
## 2 0.70674487            1            1        0.15443943      0
## 3 0.09579668            0            0        0.26227427      0
## 4 0.85337243            0            1        0.06546275      0
## 5 0.16911046            0            0        0.49924755      0
## 6 0.57233627            0            1        0.73283484      0
#Number of rows
nrow(data)
## [1] 9714

Data Mining Techniques:

We conducted both supervised and unsupervised learning on our dataset, utilizing classification and clustering techniques. For classification, we employed a decision tree algorithm, which recursively constructs a tree with leaf nodes representing the final decisions. Our goal was to predict the class label “stroke,” which has two categories: “yes” and “no.” The prediction was based on selected attributes derived from the feature selection results, namely “ever_married,” “hypertension,” “avg_glucose,” and “ages.”

The classification technique involved splitting the dataset into two subsets: Training dataset: Used for constructing the decision tree model. Testing dataset: Employed to assess the performance of the constructed model.

To evaluate the effectiveness of our model, we utilized a confusion matrix and measured both accuracy and cost-sensitive metrics on the dataset.

For clustering, it’s an unsupervised learning so we should remove class label, We used k-means approach clustering algorithm that split our data into groups that have a high intra-class similarity and low inter-class similarity, it chooses a random centers and assign each object to the cluster with the nearest center based on Euclidean distance, and iteratively updating the cluster centroids until we reach the appropriate center for each cluster, we used it because it suitable for large datasets and simpler than the other algorithms. By using the same set of attributes for both clustering and classification, we can effectively compare the results and assess the similarities or differences in the patterns identified by the two techniques. This comparison can provide insights into the effectiveness of each method for analyzing and categorizing the data. Therefore, we use the same attributes that have been used in classification,

## copying the data into 2 data frames, one for classification and one for clustering which will be used later
data_class <- data.frame(data)
data_cluster <- data.frame(data)

5.1 Classification:

Classification is supervised learning, therefore we need training data to train the model, and test data to evaluate its’ performance. We tried three different size of partitions and three attribute selection measures: Information gain (IG), Gain ratio (IG ration), and Gini index.

We always gave the training subset the biggest portion of our dataset because our model’s ability to predict the class label correctly for new data is dependent on the constructing and training of the model. Our model -represented by a decision tree algorithm- needs to be fed with large enough data to be able to build the rules correctly. So when we test the model on the test data, it can predict the labels easily and accurately. We used a balanced sample of our data consisting of 1000 tuples (which will be later split into training and testing) to create a comprehensible tree:

##define formula that takes stroke in relation to all the other features (easier to use later on)
myFormula <- stroke ~ age + hypertension +ever_married+avg_glucose_level
data_class <- data_class %>%group_by(stroke) %>% sample_n(size=500)

Evaluation

-Gain ratio (C.50):

Gain ratio is a metric used in decision tree algorithms to evaluate the quality of a split based on the information gain and the intrinsic information of a feature. It takes into account the entropy or impurity of a dataset and the potential information gained by splitting the data based on a specific feature.

The gain ratio is calculated by dividing the information gain by the split information. Information gain measures the reduction in entropy achieved by splitting the dataset based on a particular feature. Split information quantifies the potential information generated by the feature itself.

Using the gain ratio in a decision tree, the algorithm compares the gain ratios of different features and selects the feature with the highest ratio as the best split. This approach helps prevent bias towards features with a large number of values or categories.

To apply this, we used the C5.0.default method from the “C50” package. This model extends the C4.5 classification algorithms, and can take the form of a full decision tree or a collection of rules. It aims to find the feature that provides the most informative and balanced splits and construct the tree, considering both the reduction in uncertainty and the characteristics of the feature. This helps in avoiding biases and making more robust decisions during the tree construction process.

1-partition the data into ( 70% training, 30% testing):

#splitting 70% of data for training, 30% of data for testing
library('C50')
library('caTools')

set.seed(1234)
sample <- sample.split(data_class$stroke, SplitRatio = 0.7)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the c5.0 gain ratio tree
strokeTree <- C5.0(myFormula, data=trainData)
summary(strokeTree)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Nov 30 14:01:00 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 700 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## age <= 0.5723363: 0 (231/21)
## age > 0.5723363:
## :...age > 0.8411534: 1 (228/34)
##     age <= 0.8411534:
##     :...hypertension > 0: 1 (49/13)
##         hypertension <= 0:
##         :...age > 0.6823069:
##             :...age <= 0.7922776: 1 (92/29)
##             :   age > 0.7922776: 0 (28/11)
##             age <= 0.6823069:
##             :...ever_married = 0: 1 (11/5)
##                 ever_married = 1:
##                 :...age <= 0.64565: 0 (31/5)
##                     age > 0.64565:
##                     :...age <= 0.657869: 1 (13/3)
##                         age > 0.657869: 0 (17/4)
## 
## 
## Evaluation on training data (700 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       9  125(17.9%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     266    84    (a): class 0
##      41   309    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   34.43% hypertension
##   10.29% ever_married
## 
## 
## Time: 0.0 secs
plot(strokeTree)

#make predictions using the c5.0 gain ratio tree on the test data
testPred <- predict(strokeTree, newdata = testData)  
#create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
## 79.66667
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103  14
##          1  47 136
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7466, 0.8407)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5933          
##                                           
##  Mcnemar's Test P-Value : 4.182e-05       
##                                           
##             Sensitivity : 0.9067          
##             Specificity : 0.6867          
##          Pos Pred Value : 0.7432          
##          Neg Pred Value : 0.8803          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4533          
##    Detection Prevalence : 0.6100          
##       Balanced Accuracy : 0.7967          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction   0   1
##          0 103  14
##          1  47 136
as.matrix(results)
##     0   1
## 0 103  14
## 1  47 136
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.966667e-01
## Kappa          5.933333e-01
## AccuracyLower  7.466188e-01
## AccuracyUpper  8.407475e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 2.336094e-26
## McnemarPValue  4.182134e-05
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.9066667
## Specificity          0.6866667
## Pos Pred Value       0.7431694
## Neg Pred Value       0.8803419
## Precision            0.7431694
## Recall               0.9066667
## F1                   0.8168168
## Prevalence           0.5000000
## Detection Rate       0.4533333
## Detection Prevalence 0.6100000
## Balanced Accuracy    0.7966667
print(results)  
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103  14
##          1  47 136
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7466, 0.8407)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5933          
##                                           
##  Mcnemar's Test P-Value : 4.182e-05       
##                                           
##             Sensitivity : 0.9067          
##             Specificity : 0.6867          
##          Pos Pred Value : 0.7432          
##          Neg Pred Value : 0.8803          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4533          
##    Detection Prevalence : 0.6100          
##       Balanced Accuracy : 0.7967          
##                                           
##        'Positive' Class : 1               
## 

2-partition the data into ( 80% training, 20% testing):

#splitting 80% of data for training, 20% of data for testing
set.seed(1234)
sample <- sample.split(data_class$stroke, SplitRatio = 0.8)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the c5.0 gain ratio tree
strokeTree <- C5.0(myFormula, data=trainData)
summary(strokeTree)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Nov 30 12:00:36 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 800 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## age <= 0.5356794: 0 (227/13)
## age > 0.5356794:
## :...age > 0.8411534: 1 (264/43)
##     age <= 0.8411534:
##     :...hypertension > 0: 1 (56/16)
##         hypertension <= 0:
##         :...age <= 0.6823069:
##             :...ever_married = 1: 0 (101/33)
##             :   ever_married = 0:
##             :   :...age <= 0.6212121: 1 (9/3)
##             :       age > 0.6212121: 0 (4)
##             age > 0.6823069:
##             :...age <= 0.7067448: 1 (32/6)
##                 age > 0.7067448:
##                 :...avg_glucose_level > 0.1904628:
##                     :...avg_glucose_level <= 0.8173439: 1 (55/12)
##                     :   avg_glucose_level > 0.8173439: 0 (4)
##                     avg_glucose_level <= 0.1904628:
##                     :...avg_glucose_level <= 0.05723288: 0 (7)
##                         avg_glucose_level > 0.05723288:
##                         :...avg_glucose_level <= 0.1028029: 1 (12/3)
##                             avg_glucose_level > 0.1028029: 0 (29/9)
## 
## 
## Evaluation on training data (800 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      12  138(17.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     317    83    (a): class 0
##      55   345    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   38.62% hypertension
##   14.25% ever_married
##   13.38% avg_glucose_level
## 
## 
## Time: 0.0 secs
plot(strokeTree)

#make predictions using the c5.0 gain ratio tree on the test data
testPred <- predict(strokeTree, newdata = testData)
#create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
##     79.5
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 66  7
##          1 34 93
##                                           
##                Accuracy : 0.795           
##                  95% CI : (0.7323, 0.8487)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.59            
##                                           
##  Mcnemar's Test P-Value : 4.896e-05       
##                                           
##             Sensitivity : 0.9300          
##             Specificity : 0.6600          
##          Pos Pred Value : 0.7323          
##          Neg Pred Value : 0.9041          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4650          
##    Detection Prevalence : 0.6350          
##       Balanced Accuracy : 0.7950          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction  0  1
##          0 66  7
##          1 34 93
as.matrix(results)
##    0  1
## 0 66  7
## 1 34 93
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.950000e-01
## Kappa          5.900000e-01
## AccuracyLower  7.323461e-01
## AccuracyUpper  8.486894e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 6.670804e-18
## McnemarPValue  4.896400e-05
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.9300000
## Specificity          0.6600000
## Pos Pred Value       0.7322835
## Neg Pred Value       0.9041096
## Precision            0.7322835
## Recall               0.9300000
## F1                   0.8193833
## Prevalence           0.5000000
## Detection Rate       0.4650000
## Detection Prevalence 0.6350000
## Balanced Accuracy    0.7950000
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 66  7
##          1 34 93
##                                           
##                Accuracy : 0.795           
##                  95% CI : (0.7323, 0.8487)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.59            
##                                           
##  Mcnemar's Test P-Value : 4.896e-05       
##                                           
##             Sensitivity : 0.9300          
##             Specificity : 0.6600          
##          Pos Pred Value : 0.7323          
##          Neg Pred Value : 0.9041          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4650          
##    Detection Prevalence : 0.6350          
##       Balanced Accuracy : 0.7950          
##                                           
##        'Positive' Class : 1               
## 

3-partition the data into ( 85% training, 15% testing):

#splitting 85% of data for training, 15% of data for testing
trctrl <- trainControl(method = "cv", number = 10, savePredictions=TRUE)
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.85)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the c5.0 gain ratio tree
strokeTree <- C5.0(myFormula, data=trainData,)
summary(strokeTree)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Nov 30 12:00:36 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 850 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## age <= 0.5845552:
## :...age <= 0.3768328: 0 (157)
## :   age > 0.3768328:
## :   :...age <= 0.5356794:
## :       :...age > 0.3890518: 0 (76/6)
## :       :   age <= 0.3890518:
## :       :   :...avg_glucose_level <= 0.1020504: 1 (4)
## :       :       avg_glucose_level > 0.1020504: 0 (6)
## :       age > 0.5356794:
## :       :...avg_glucose_level > 0.1939898: 0 (19)
## :           avg_glucose_level <= 0.1939898:
## :           :...ever_married = 0: 1 (3)
## :               ever_married = 1:
## :               :...avg_glucose_level > 0.1481377: 1 (8/1)
## :                   avg_glucose_level <= 0.1481377:
## :                   :...avg_glucose_level <= 0.04270128: 1 (4/1)
## :                       avg_glucose_level > 0.04270128: 0 (10)
## age > 0.5845552:
## :...hypertension > 0: 1 (135/23)
##     hypertension <= 0:
##     :...age > 0.8900293:
##         :...age <= 0.9755621: 1 (126/12)
##         :   age > 0.9755621:
##         :   :...avg_glucose_level > 0.7749718: 0 (4)
##         :       avg_glucose_level <= 0.7749718:
##         :       :...ever_married = 0: 0 (3/1)
##         :           ever_married = 1:
##         :           :...avg_glucose_level <= 0.0162246: 0 (2)
##         :               avg_glucose_level > 0.0162246: 1 (32/4)
##         age <= 0.8900293:
##         :...age <= 0.6823069:
##             :...avg_glucose_level <= 0.4466704: 0 (51/13)
##             :   avg_glucose_level > 0.4466704:
##             :   :...avg_glucose_level <= 0.6734857: 1 (11/1)
##             :       avg_glucose_level > 0.6734857: 0 (5)
##             age > 0.6823069:
##             :...avg_glucose_level > 0.2647197: 1 (76/17)
##                 avg_glucose_level <= 0.2647197:
##                 :...age <= 0.7311828: 1 (47/14)
##                     age > 0.7311828:
##                     :...age <= 0.8289345: 0 (42/14)
##                         age > 0.8289345:
##                         :...age <= 0.8655914:
##                             :...avg_glucose_level <= 0.1500658: 1 (9)
##                             :   avg_glucose_level > 0.1500658:
##                             :   :...avg_glucose_level <= 0.1805869: 0 (3)
##                             :       avg_glucose_level > 0.1805869: 1 (7/2)
##                             age > 0.8655914:
##                             :...avg_glucose_level <= 0.07138826: 0 (3)
##                                 avg_glucose_level > 0.07138826:
##                                 :...avg_glucose_level <= 0.2042419: 1 (4)
##                                     avg_glucose_level > 0.2042419: 0 (3)
## 
## 
## Evaluation on training data (850 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      27  109(12.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     350    75    (a): class 0
##      34   391    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% age
##   66.24% hypertension
##   41.88% avg_glucose_level
##    7.29% ever_married
## 
## 
## Time: 0.0 secs
plot(strokeTree)

#make predictions using the c5.0 gain ratio tree on the test data
testPred <- predict(strokeTree, newdata = testData)
#create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
##       78
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 53 11
##          1 22 64
##                                           
##                Accuracy : 0.78            
##                  95% CI : (0.7051, 0.8435)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.603e-12       
##                                           
##                   Kappa : 0.56            
##                                           
##  Mcnemar's Test P-Value : 0.08172         
##                                           
##             Sensitivity : 0.8533          
##             Specificity : 0.7067          
##          Pos Pred Value : 0.7442          
##          Neg Pred Value : 0.8281          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4267          
##    Detection Prevalence : 0.5733          
##       Balanced Accuracy : 0.7800          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction  0  1
##          0 53 11
##          1 22 64
as.matrix(results)
##    0  1
## 0 53 11
## 1 22 64
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.800000e-01
## Kappa          5.600000e-01
## AccuracyLower  7.051459e-01
## AccuracyUpper  8.434627e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 1.603201e-12
## McnemarPValue  8.172275e-02
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.8533333
## Specificity          0.7066667
## Pos Pred Value       0.7441860
## Neg Pred Value       0.8281250
## Precision            0.7441860
## Recall               0.8533333
## F1                   0.7950311
## Prevalence           0.5000000
## Detection Rate       0.4266667
## Detection Prevalence 0.5733333
## Balanced Accuracy    0.7800000
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 53 11
##          1 22 64
##                                           
##                Accuracy : 0.78            
##                  95% CI : (0.7051, 0.8435)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.603e-12       
##                                           
##                   Kappa : 0.56            
##                                           
##  Mcnemar's Test P-Value : 0.08172         
##                                           
##             Sensitivity : 0.8533          
##             Specificity : 0.7067          
##          Pos Pred Value : 0.7442          
##          Neg Pred Value : 0.8281          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4267          
##    Detection Prevalence : 0.5733          
##       Balanced Accuracy : 0.7800          
##                                           
##        'Positive' Class : 1               
## 

Now lets compare between the different partitions results in GAIN RATIO

70 % training set 30% testing set: 80 % training set 20% testing set: 85 % training set 15% testing set:
Accuracy 0.7967 0.795 0.78
Precision 0.7432 0.7323 0.7442
Sensitivity 0.9067 0.9300 0.8533
Specificity 0.6867 0.6600 0.7067

Among these partitioning ratios, the model trained on the 70% training set and 30% testing set achieved the highest accuracy (0.7967), followed by the model trained on the 80% training set and 20% testing set (0.795), and the model trained on the 85% training set and 15% testing set (0.78).

In terms of precision, the model trained on the 85% training set and 15% testing set achieved the highest precision (0.7442), followed by the model trained on the 70% training set and 30% testing set (0.7432), and the model trained on the 80% training set and 20% testing set (0.7323).

For sensitivity, the model trained on the 80% training set and 20% testing set had the highest sensitivity (0.9300), followed by the model trained on the 70% training set and 30% testing set (0.9067), and the model trained on the 85% training set and 15% testing set (0.8533).

In terms of specificity, the model trained on the 85% training set and 15% testing set achieved the highest specificity (0.7067), followed by the model trained on the 70% training set and 30% testing set (0.6867), and the model trained on the 80% training set and 20% testing set (0.6600).

Based on these results, the model trained on the 70% training set and 30% testing set appears to be the best partitioning ratio, as it showed the highest accuracy, as well as having the highest results in terms of precision, sensitivity, and specificity. The 85% training set and 15% testing set had good results, but had a considerably low sensitivity compared to the others.

Overall, the results obtained from the decision tree using the gain ratio as the splitting criterion are realistic and promising. The accuracy values are reasonably high, indicating that the model can make correct predictions on the testing data. The precision, sensitivity, and specificity measures also demonstrate a good balance between correctly identifying the positive instances, correctly identifying the negative instances, and avoiding false positives and false negatives.

-Gini index (rpart):

Gini index is a measure of impurity or the degree of disorder in a dataset. It is commonly used in decision tree algorithms to evaluate the quality of a split when constructing the tree.

When building a decision tree, each potential split is assessed using the Gini index. The Gini index calculates the probability of misclassifying a randomly chosen element from a dataset if it were randomly labeled according to the distribution of classes in that subset. A lower Gini index indicates a more pure or homogeneous subset, meaning that the classes within that subset are similar.

To use the Gini index in a decision tree, the algorithm considers various potential splits based on different features in the dataset. It calculates the Gini index for each split and selects the one with the lowest value. The chosen split results in the highest possible purity or homogeneity of the resulting subsets.

For Gini index we used Rpart which is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use. Rpart has many useful methods which assist us in building our model, for example rpart() for constructing the model and rpart.plot() for representing the tree

1-partition the data into ( 70% training, 30% testing):

#splitting 70% of data for training, 30% of data for testing
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.7)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)

#make predictions using the rpart gini index tree on the test data
testPred <- predict(tree, newdata = testData,type = 'class')
##create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
##       74
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  90  18
##          1  60 132
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6865, 0.7887)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.48            
##                                           
##  Mcnemar's Test P-Value : 3.445e-06       
##                                           
##             Sensitivity : 0.8800          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4400          
##    Detection Prevalence : 0.6400          
##       Balanced Accuracy : 0.7400          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction   0   1
##          0  90  18
##          1  60 132
as.matrix(results)
##    0   1
## 0 90  18
## 1 60 132
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.400000e-01
## Kappa          4.800000e-01
## AccuracyLower  6.864771e-01
## AccuracyUpper  7.887132e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 1.812272e-17
## McnemarPValue  3.444923e-06
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.8800000
## Specificity          0.6000000
## Pos Pred Value       0.6875000
## Neg Pred Value       0.8333333
## Precision            0.6875000
## Recall               0.8800000
## F1                   0.7719298
## Prevalence           0.5000000
## Detection Rate       0.4400000
## Detection Prevalence 0.6400000
## Balanced Accuracy    0.7400000
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  90  18
##          1  60 132
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6865, 0.7887)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.48            
##                                           
##  Mcnemar's Test P-Value : 3.445e-06       
##                                           
##             Sensitivity : 0.8800          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4400          
##    Detection Prevalence : 0.6400          
##       Balanced Accuracy : 0.7400          
##                                           
##        'Positive' Class : 1               
## 

2-partition the data into ( 80% training, 20% testing):

#splitting 80% of data for training, 20% of data for testing
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.8)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)

#make predictions using the rpart gini index tree on the test data
testPred <- predict(tree, newdata = testData,type = 'class')
##create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
##       78
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70 14
##          1 30 86
##                                           
##                Accuracy : 0.78            
##                  95% CI : (0.7161, 0.8354)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 3.419e-16       
##                                           
##                   Kappa : 0.56            
##                                           
##  Mcnemar's Test P-Value : 0.02374         
##                                           
##             Sensitivity : 0.8600          
##             Specificity : 0.7000          
##          Pos Pred Value : 0.7414          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4300          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.7800          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction  0  1
##          0 70 14
##          1 30 86
as.matrix(results)
##    0  1
## 0 70 14
## 1 30 86
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.800000e-01
## Kappa          5.600000e-01
## AccuracyLower  7.161388e-01
## AccuracyUpper  8.353639e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 3.418956e-16
## McnemarPValue  2.373852e-02
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.8600000
## Specificity          0.7000000
## Pos Pred Value       0.7413793
## Neg Pred Value       0.8333333
## Precision            0.7413793
## Recall               0.8600000
## F1                   0.7962963
## Prevalence           0.5000000
## Detection Rate       0.4300000
## Detection Prevalence 0.5800000
## Balanced Accuracy    0.7800000
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 70 14
##          1 30 86
##                                           
##                Accuracy : 0.78            
##                  95% CI : (0.7161, 0.8354)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 3.419e-16       
##                                           
##                   Kappa : 0.56            
##                                           
##  Mcnemar's Test P-Value : 0.02374         
##                                           
##             Sensitivity : 0.8600          
##             Specificity : 0.7000          
##          Pos Pred Value : 0.7414          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4300          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.7800          
##                                           
##        'Positive' Class : 1               
## 

3-partition the data into ( 85% training, 15% testing):

#splitting 85% of data for training, 15% of data for testing
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.85)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)

#make predictions using the rpart gini index tree on the test data
testPred <- predict(tree, newdata = testData,type = 'class')
##create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
##       74
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 13
##          1 26 62
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6621, 0.8081)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.692e-09       
##                                           
##                   Kappa : 0.48            
##                                           
##  Mcnemar's Test P-Value : 0.05466         
##                                           
##             Sensitivity : 0.8267          
##             Specificity : 0.6533          
##          Pos Pred Value : 0.7045          
##          Neg Pred Value : 0.7903          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4133          
##    Detection Prevalence : 0.5867          
##       Balanced Accuracy : 0.7400          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction  0  1
##          0 49 13
##          1 26 62
as.matrix(results)
##    0  1
## 0 49 13
## 1 26 62
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.400000e-01
## Kappa          4.800000e-01
## AccuracyLower  6.621433e-01
## AccuracyUpper  8.081242e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 1.692319e-09
## McnemarPValue  5.466394e-02
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.8266667
## Specificity          0.6533333
## Pos Pred Value       0.7045455
## Neg Pred Value       0.7903226
## Precision            0.7045455
## Recall               0.8266667
## F1                   0.7607362
## Prevalence           0.5000000
## Detection Rate       0.4133333
## Detection Prevalence 0.5866667
## Balanced Accuracy    0.7400000
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 13
##          1 26 62
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6621, 0.8081)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.692e-09       
##                                           
##                   Kappa : 0.48            
##                                           
##  Mcnemar's Test P-Value : 0.05466         
##                                           
##             Sensitivity : 0.8267          
##             Specificity : 0.6533          
##          Pos Pred Value : 0.7045          
##          Neg Pred Value : 0.7903          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4133          
##    Detection Prevalence : 0.5867          
##       Balanced Accuracy : 0.7400          
##                                           
##        'Positive' Class : 1               
## 

Now let’s compare between the different partitions results in GINI INDEX:

70 % training set 30% testing set: 80 % training set 20% testing set: 85 % training set 15% testing set:
Accuracy 0.74 0.78 0.74
Precision 0.6875 0.7414 0.7045
Sensitivity 0.8800 0.8600 0.8267
Specificity 0.6000 0.7000 0.6533

Looking at the accuracy values, the model trained on the 80% training set and tested on the 20% testing set achieved the highest accuracy (0.78). Both the models trained on the 85% training set and 15% testing set and the 70% training set and 30% testing set had the same accuracy (0.74).

In terms of precision, which measures the proportion of correctly predicted positive instances, the model trained on the 80% training set and tested on the 20% testing set also obtained the highest precision (0.7414), followed by the model trained on the 85% training set and 15% testing set (0.7045), and the model trained on the 70% training set and 30% testing set (0.6875).

For sensitivity, which represents the ability to correctly identify positive instances, the model trained on the 70% training set and 30% testing set had the highest sensitivity (0.8800), followed by the models trained on the 80% training set and 20% testing set with a sensitivity of (0.8600), and the 85% training set and 15% testing set with a sensitivity of (0.8267).

In terms of specificity, which measures the ability to correctly identify negative instances, the model trained on the 80% training set and 20% testing set achieved the highest specificity (0.7000), followed by the model trained on the 85% training set and 15% testing set (0.6533), and the model trained on the 70% training set and 30% testing set (0.6000).

Based on these results, the model trained on the 80% training set and 20% testing set appears to be the best partitioning ratio, as it showed the highest accuracy, as well as having the highest results in terms of precision, sensitivity, and specificity. It should be noted that even though the models trained on the 85% training set and 15% testing set and the 70% training set and 30% testing set achieved the same accuracy, the 85 % training set 15% testing set is a better model because it can predict more accurately based on the results of precision, sensitivity and specificity.

-Information gain (ctree):

Information gain is a measure used in decision tree algorithms to evaluate the usefulness of a feature in splitting the data. It quantifies the reduction in entropy or impurity achieved by splitting the dataset based on that feature.

Entropy is a measure of the disorder or randomness in a dataset, specifically the uncertainty of class labels. Information gain calculates the difference between the entropy of the parent node (before the split) and the weighted average of the entropies of the child nodes (after the split).

To use information gain in a decision tree, the algorithm examines different features and calculates the information gain for each one. The feature with the highest information gain is selected as the best choice for splitting the data.

By selecting features with high information gain, the decision tree algorithm aims to create subsets that are more homogeneous or pure in terms of the class labels. This allows for better classification or prediction within each subset.

To apply this, we have used the ctree method from the “party” package. This method requires the continuous valued attributes to be discritized, so we will discretize age, avg_glucose_level, and bmi. The method doesn’t accept char values, so we will change hypertension, heart_disease, ever_married, work_type, and smoking_status:

set.seed(123)
library('MASS')
library("discretization")
cutPoints(data_class$age,data_class$stroke)
## [1] 0.3829423 0.5784457 0.8472630
data_class$age= cut(data_class$age, breaks= seq(0,1, by=0.2),right=TRUE)

cutPoints(data_class$avg_glucose_level,data_class$stroke)
## [1] 0.5006114
data_class$avg_glucose_level    = cut(data_class$avg_glucose_level, breaks= seq(0,1, by=0.5),right=TRUE)

data_class$hypertension <- as.factor(data_class$hypertension )
data_class$ever_married   <- as.factor(data_class$ever_married  )

1-partition the data into ( 70% training, 30% testing):

#splitting 70% of data for training, 30% of data for testing
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.7)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the ctree information gain tree
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:modeltools':
## 
##     empty
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
library(readr)
stroke_ctree <- ctree(myFormula, data=trainData)
print(stroke_ctree)
## 
##   Conditional inference tree with 6 terminal nodes
## 
## Response:  stroke 
## Inputs:  age, hypertension, ever_married, avg_glucose_level 
## Number of observations:  700 
## 
## 1) age == {(0,0.2], (0.2,0.4], (0.4,0.6]}; criterion = 1, statistic = 271.411
##   2) age == {(0,0.2], (0.2,0.4]}; criterion = 0.999, statistic = 17.546
##     3) ever_married == {1}; criterion = 0.998, statistic = 11.699
##       4)*  weights = 36 
##     3) ever_married == {0}
##       5)*  weights = 103 
##   2) age == {(0.4,0.6]}
##     6)*  weights = 109 
## 1) age == {(0.6,0.8], (0.8,1]}
##   7) age == {(0.8,1]}; criterion = 1, statistic = 19.669
##     8)*  weights = 275 
##   7) age == {(0.6,0.8]}
##     9) avg_glucose_level == {(0,0.5]}; criterion = 0.99, statistic = 9.09
##       10)*  weights = 121 
##     9) avg_glucose_level == {(0.5,1]}
##       11)*  weights = 56
plot(stroke_ctree,type="simple")

plot(stroke_ctree)

#make predictions using the ctree information gain tree on the test data
testPred <- predict(stroke_ctree, newdata = testData)
##create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
## 74.33333
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  84  11
##          1  66 139
##                                         
##                Accuracy : 0.7433        
##                  95% CI : (0.69, 0.7918)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.4867        
##                                         
##  Mcnemar's Test P-Value : 7.561e-10     
##                                         
##             Sensitivity : 0.9267        
##             Specificity : 0.5600        
##          Pos Pred Value : 0.6780        
##          Neg Pred Value : 0.8842        
##              Prevalence : 0.5000        
##          Detection Rate : 0.4633        
##    Detection Prevalence : 0.6833        
##       Balanced Accuracy : 0.7433        
##                                         
##        'Positive' Class : 1             
## 
as.table(results)
##           Reference
## Prediction   0   1
##          0  84  11
##          1  66 139
as.matrix(results)
##    0   1
## 0 84  11
## 1 66 139
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.433333e-01
## Kappa          4.866667e-01
## AccuracyLower  6.899830e-01
## AccuracyUpper  7.918063e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 6.281983e-18
## McnemarPValue  7.561413e-10
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.9266667
## Specificity          0.5600000
## Pos Pred Value       0.6780488
## Neg Pred Value       0.8842105
## Precision            0.6780488
## Recall               0.9266667
## F1                   0.7830986
## Prevalence           0.5000000
## Detection Rate       0.4633333
## Detection Prevalence 0.6833333
## Balanced Accuracy    0.7433333
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  84  11
##          1  66 139
##                                         
##                Accuracy : 0.7433        
##                  95% CI : (0.69, 0.7918)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.4867        
##                                         
##  Mcnemar's Test P-Value : 7.561e-10     
##                                         
##             Sensitivity : 0.9267        
##             Specificity : 0.5600        
##          Pos Pred Value : 0.6780        
##          Neg Pred Value : 0.8842        
##              Prevalence : 0.5000        
##          Detection Rate : 0.4633        
##    Detection Prevalence : 0.6833        
##       Balanced Accuracy : 0.7433        
##                                         
##        'Positive' Class : 1             
## 

2-partition the data into ( 80% training, 20% testing):

#splitting 80% of data for training, 20% of data for testing
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.8)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the ctree information gain tree
library(party)
library(plyr)
library(readr)
stroke_ctree <- ctree(myFormula, data=trainData)
print(stroke_ctree)
## 
##   Conditional inference tree with 7 terminal nodes
## 
## Response:  stroke 
## Inputs:  age, hypertension, ever_married, avg_glucose_level 
## Number of observations:  800 
## 
## 1) age == {(0,0.2], (0.2,0.4], (0.4,0.6]}; criterion = 1, statistic = 296.302
##   2) age == {(0,0.2], (0.2,0.4]}; criterion = 1, statistic = 21.262
##     3) ever_married == {0}; criterion = 0.997, statistic = 11.245
##       4)*  weights = 113 
##     3) ever_married == {1}
##       5)*  weights = 41 
##   2) age == {(0.4,0.6]}
##     6)*  weights = 121 
## 1) age == {(0.6,0.8], (0.8,1]}
##   7) age == {(0.6,0.8]}; criterion = 1, statistic = 22.418
##     8) avg_glucose_level == {(0,0.5]}; criterion = 0.993, statistic = 9.808
##       9)*  weights = 145 
##     8) avg_glucose_level == {(0.5,1]}
##       10)*  weights = 62 
##   7) age == {(0.8,1]}
##     11) hypertension == {0}; criterion = 0.961, statistic = 6.653
##       12)*  weights = 232 
##     11) hypertension == {1}
##       13)*  weights = 86
plot(stroke_ctree,type="simple")

plot(stroke_ctree)

#make predictions using the ctree information gain tree on the test data
testPred <- predict(stroke_ctree, newdata = testData)
##create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
##       76
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 60  8
##          1 40 92
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6947, 0.8174)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 4.369e-14       
##                                           
##                   Kappa : 0.52            
##                                           
##  Mcnemar's Test P-Value : 7.660e-06       
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.6970          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4600          
##    Detection Prevalence : 0.6600          
##       Balanced Accuracy : 0.7600          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction  0  1
##          0 60  8
##          1 40 92
as.matrix(results)
##    0  1
## 0 60  8
## 1 40 92
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.600000e-01
## Kappa          5.200000e-01
## AccuracyLower  6.946937e-01
## AccuracyUpper  8.174281e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 4.369218e-14
## McnemarPValue  7.660302e-06
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.9200000
## Specificity          0.6000000
## Pos Pred Value       0.6969697
## Neg Pred Value       0.8823529
## Precision            0.6969697
## Recall               0.9200000
## F1                   0.7931034
## Prevalence           0.5000000
## Detection Rate       0.4600000
## Detection Prevalence 0.6600000
## Balanced Accuracy    0.7600000
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 60  8
##          1 40 92
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6947, 0.8174)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 4.369e-14       
##                                           
##                   Kappa : 0.52            
##                                           
##  Mcnemar's Test P-Value : 7.660e-06       
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.6970          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4600          
##    Detection Prevalence : 0.6600          
##       Balanced Accuracy : 0.7600          
##                                           
##        'Positive' Class : 1               
## 

3-partition the data into ( 85% training, 15% testing):

#splitting 85% of data for training, 15% of data for testing
set.seed(123)
sample <- sample.split(data_class$stroke, SplitRatio = 0.85)
trainData  <- subset(data_class, sample == TRUE)
testData   <- subset(data_class, sample == FALSE)

#train using the trainData and create the ctree information gain tree
library(party)
library(plyr)
library(readr)
stroke_ctree <- ctree(myFormula, data=trainData)
print(stroke_ctree)
## 
##   Conditional inference tree with 6 terminal nodes
## 
## Response:  stroke 
## Inputs:  age, hypertension, ever_married, avg_glucose_level 
## Number of observations:  850 
## 
## 1) age == {(0,0.2], (0.2,0.4], (0.4,0.6]}; criterion = 1, statistic = 330.525
##   2) age == {(0,0.2], (0.2,0.4]}; criterion = 1, statistic = 21.56
##     3) ever_married == {0}; criterion = 0.997, statistic = 11.388
##       4)*  weights = 123 
##     3) ever_married == {1}
##       5)*  weights = 44 
##   2) age == {(0.4,0.6]}
##     6)*  weights = 129 
## 1) age == {(0.6,0.8], (0.8,1]}
##   7) age == {(0.6,0.8]}; criterion = 1, statistic = 24.892
##     8) avg_glucose_level == {(0,0.5]}; criterion = 0.991, statistic = 9.373
##       9)*  weights = 149 
##     8) avg_glucose_level == {(0.5,1]}
##       10)*  weights = 65 
##   7) age == {(0.8,1]}
##     11)*  weights = 340
plot(stroke_ctree,type="simple")

plot(stroke_ctree)

#make predictions using the ctree information gain tree on the test data
testPred <- predict(stroke_ctree, newdata = testData)
##create confusion matrix to evaluate the model's performance
library(caret)
results <- confusionMatrix(testPred, testData$stroke, positive= "1")
acc <- results$overall["Accuracy"]*100
acc
## Accuracy 
## 70.66667
results
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39  8
##          1 36 67
##                                           
##                Accuracy : 0.7067          
##                  95% CI : (0.6269, 0.7781)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 2.200e-07       
##                                           
##                   Kappa : 0.4133          
##                                           
##  Mcnemar's Test P-Value : 4.693e-05       
##                                           
##             Sensitivity : 0.8933          
##             Specificity : 0.5200          
##          Pos Pred Value : 0.6505          
##          Neg Pred Value : 0.8298          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4467          
##    Detection Prevalence : 0.6867          
##       Balanced Accuracy : 0.7067          
##                                           
##        'Positive' Class : 1               
## 
as.table(results)
##           Reference
## Prediction  0  1
##          0 39  8
##          1 36 67
as.matrix(results)
##    0  1
## 0 39  8
## 1 36 67
as.matrix(results, what = "overall")
##                        [,1]
## Accuracy       7.066667e-01
## Kappa          4.133333e-01
## AccuracyLower  6.268758e-01
## AccuracyUpper  7.780937e-01
## AccuracyNull   5.000000e-01
## AccuracyPValue 2.199726e-07
## McnemarPValue  4.693185e-05
as.matrix(results, what = "classes")
##                           [,1]
## Sensitivity          0.8933333
## Specificity          0.5200000
## Pos Pred Value       0.6504854
## Neg Pred Value       0.8297872
## Precision            0.6504854
## Recall               0.8933333
## F1                   0.7528090
## Prevalence           0.5000000
## Detection Rate       0.4466667
## Detection Prevalence 0.6866667
## Balanced Accuracy    0.7066667
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39  8
##          1 36 67
##                                           
##                Accuracy : 0.7067          
##                  95% CI : (0.6269, 0.7781)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 2.200e-07       
##                                           
##                   Kappa : 0.4133          
##                                           
##  Mcnemar's Test P-Value : 4.693e-05       
##                                           
##             Sensitivity : 0.8933          
##             Specificity : 0.5200          
##          Pos Pred Value : 0.6505          
##          Neg Pred Value : 0.8298          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4467          
##    Detection Prevalence : 0.6867          
##       Balanced Accuracy : 0.7067          
##                                           
##        'Positive' Class : 1               
## 

Now lets compare between the different partitions results in INFORMATION GAIN

70 % training set 30% testing set: 80 % training set 20% testing set: 85 % training set 15% testing set:
Accuracy 0.7433 0.76 0.7067
Precision 0.6780 0.6970 0.6505
Sensitivity 0.9267 0.9200 0.8933
Specificity 0.5600 0.6000 0.5200

Among these partitioning ratios, the model trained on the 80% training set and 20% testing achieved the highest accuracy (0.76), followed by the model trained on the 70% training set and 30% testing set (0.7433), followed by the model trained on the 85% training set and 15% testing set with an accuracy of (0.7067).

In terms of precision, the model trained on the 80% training set and 20% testing set obtained the highest precision (0.6970), followed by the model trained on the 70% training set and 30% testing set (0.6780), and the model trained on the 85% training set and 15% testing set (0.6505).

For sensitivity, the model trained on the 70% training set and 30% testing set achieved the highest sensitivity (0.9267), followed by the models trained on the 80% training set and 20% testing set (0.9200), and the 85% training set and 15% testing set (0.8933).

In terms of specificity, the model trained on the 80% training set and 20% testing set obtained the highest specificity (0.6000), followed by the model trained on the 70% training set and 30% testing set (0.5600), and the model trained on the 85% training set and 15% testing set (0.5200).

Based on these results, the model trained on the 80% training set and 20% testing set showed the best performance. The accuracy, precision, sensitivity, and specificity measures are relatively high, indicating that the model can make reasonably accurate predictions.

Overall, the results obtained from the decision tree using information gain as the splitting criterion are realistic and indicate good performance.

5.2 Clustering:

The clustering process involves grouping similar data points together based on their inherent characteristics or similarities. One popular clustering algorithm is k-means, which partitions the dataset into k clusters, where each data point belongs to the cluster with the nearest mean value based on Euclidean distance.

By applying the k-means algorithm to the Stroke Prediction Dataset, while removing the stroke class label and converting attributes to numeric format, we can effectively perform clustering analysis to identify patterns and groupings within the data. This process aids in uncovering valuable insights and potentially discovering hidden relationships among the variables, which can further enhance our understanding of stroke risk factors and contribute to better healthcare strategies

We can implement clustering by following a few steps.

First, we remove the class label “stroke” from the dataset since clustering is an unsupervised learning task and does not require labeled data. In addition, we remove certain attributes (heart_disease, work_type, smoking_status, and bmi) and selected the top four attributes (age, hypertension, ever_married, avg_glucose_level) after applying feature selection.

library(factoextra)     
sr<-(data_cluster$stroke)
data_cluster <- data_cluster[,!names(data_cluster) %in% c("stroke","id","Residence_type","gender","heart_disease","bmi","work_type","smoking_status")]

head(data_cluster)
##          age hypertension ever_married avg_glucose_level
## 1 0.03470186            0            0        0.18811136
## 2 0.70674487            1            1        0.15443943
## 3 0.09579668            0            0        0.26227427
## 4 0.85337243            0            1        0.06546275
## 5 0.16911046            0            0        0.49924755
## 6 0.57233627            0            1        0.73283484
str(data_cluster)
## 'data.frame':    9714 obs. of  4 variables:
##  $ age              : num  0.0347 0.7067 0.0958 0.8534 0.1691 ...
##  $ hypertension     : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ ever_married     : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 2 2 2 ...
##  $ avg_glucose_level: num  0.1881 0.1544 0.2623 0.0655 0.4992 ...

Then we convert all the remaining attributes in the dataset to numeric format. This step ensures compatibility with the k-means algorithm, which operates on numerical data.

#Converting interger&factor columns too numeric

data_cluster$age<- as.numeric(data_cluster$age )
data_cluster$hypertension <- as.numeric(data_cluster$hypertension )
data_cluster$avg_glucose_level <- as.numeric(data_cluster$avg_glucose_level)
data_cluster$ever_married <- as.numeric(data_cluster$ever_married)
#for ease creating a data set without color column
data_no_color <- data_cluster[1:4]

#Let us see the structure again
str(data_no_color)
## 'data.frame':    9714 obs. of  4 variables:
##  $ age              : num  0.0347 0.7067 0.0958 0.8534 0.1691 ...
##  $ hypertension     : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ ever_married     : num  1 2 1 2 1 2 2 2 2 2 ...
##  $ avg_glucose_level: num  0.1881 0.1544 0.2623 0.0655 0.4992 ...

Once the preprocessing is complete, we can proceed with implementing the k-means clustering algorithm. We start by selecting a suitable value for k, which represents the number of clusters we want to identify in the data. To determine the optimal choice of k, we can try different key sizes such as 2, 3, or 5, and evaluate the clustering results for each value.

After running the k-means algorithm with different values of k, we can assess the quality of the clustering using various evaluation metrics, such as within-cluster sum of squares, silhouette analysis, and BCubed precision and recall. These metrics help us determine the optimal number of clusters that provide the most meaningful and well-separated groups.

-The average silhouette score is measure how similar an object is to its own cluster compared to other clusters . It ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

-The WSS (Within-Cluster Sum of Squares) is a metric that measures the compactness of clusters.

-Bcubed is an evaluation metric used in information retrieval and clustering tasks to assess the quality of clustering results. It focuses on measuring the precision, recall.

To determine the optimal choice of k, we try different key sizes, such as 2, 3, or 5.

- 2 Clusters:

#######cluster k=2

#calculate k-mean k=2
km <- kmeans(data_cluster, 2, iter.max = 140 , algorithm="Lloyd", nstart=100)



#plot k-mean 
fviz_cluster(list(data = data_cluster, cluster = km$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

The analysis of the clustering results reveals several significant findings. Firstly, the observation of two non-overlapping clusters in the image indicates the successful separation of data points into distinct groups by the employed clustering algorithm. This outcome is indicative of favorable grouping results, as it demonstrates the algorithm’s ability to identify and delineate cohesive clusters within the dataset.

Furthermore, the considerable distance observed between the two clusters signifies a clear separation in the feature space. This characteristic is highly desirable in clustering tasks, as it suggests a substantial dissimilarity between the data points belonging to different clusters. The notable distance between clusters enhances the interpretability of the results and indicates the presence of distinctive patterns or characteristics within each group.

Additionally, the nearly identical sizes of the two clusters are a notable observation. The presence of similar cluster sizes is advantageous in clustering analysis as it implies a balanced distribution of data points across the clusters. This balance facilitates a fair representation of the underlying data and simplifies the interpretation of the unique characteristics exhibited by each cluster.

Average silhouette:

#avg silhouette
library(cluster)
sil <- silhouette(km$cluster, dist(data_cluster))
rownames(sil) <- rownames(data_cluster)
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 2278          0.58
## 2       2 7436          0.52

These distinct values suggest the presence of characteristic attributes that differentiate the clusters from each other. Such differentiation is indicative of reasonable clustering outcomes, as it implies the identification of distinct subgroups within the dataset based on the selected attributes. Moreover, the average silhouette widths, measuring 0.58 and 0.52, offer further evaluation of the clustering quality.

A silhouette width above 0.5 is generally considered indicative of a reasonable degree of separation between data points within their assigned clusters. In this analysis, the observed silhouette widths surpass this threshold, suggesting a satisfactory level of discrimination and cohesion within each cluster. Consequently, the clustering algorithm has likely achieved a suitable level of within-cluster similarity and between-cluster dissimilarity, reinforcing the reliability of the clustering results.

Total within-cluster sum of square:

# Total sum of squares
km$tot.withinss
## [1] 2500.985

Total Sum of Squares (WSS), which serves as a metric for assessing the compactness of the clusters. However, the absolute value of WSS alone may not provide a definitive assessment of the clustering quality and should be interpreted relative to other evaluations.

BCubed precision and recall

cluster_assignments <- c(km$cluster)
ground_truth_labels <- c(data$stroke)
datacluster <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)

# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(datacluster) {
  n <- nrow(datacluster)
  precision_sum <- 0
  recall_sum <- 0
  
  for (i in 1:n) {
    cluster <- datacluster$cluster[i]
    label <- datacluster$label[i]
    
    # Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(datacluster$label[datacluster$cluster == cluster] == label)
    
# Count the total number of items in the same cluster
total_same_cluster <- sum(datacluster$cluster == cluster)
    
# Count the total number of items with the same category
total_same_category <- sum(datacluster$label == label)
    
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
  }

  # Calculate average precision and recall
  precision <- precision_sum / n
  recall <- recall_sum / n

  return(list(precision = precision, recall = recall))
}

# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(datacluster)

# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall

# Print the results
cat("BCubed Precision:", precision, "\n")
## BCubed Precision: 0.5404063
cat("BCubed Recall:", recall, "\n")
## BCubed Recall: 0.669987

This is a user defined function that calculates the BCubed precision and recall. The higher the values, the better performance the clustering method can get. Our precision (0.5404063) , as well as recall (0.669987) are generally desirable, as they indicate better agreement between the clustering results and the reference class label “stroke”.

- 3 Clusters:

#######cluster k=3

#calculate k-mean
km <- kmeans(data_cluster, 3, iter.max = 140 , algorithm="Lloyd", nstart=100)

#plot k-mean
fviz_cluster(list(data = data_cluster, cluster = km$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

We performed clustering analysis with a value of k set to 3 using the 3-mean clustering method. The resulting cluster figure revealed three clusters of unequal size, with one cluster appearing significantly larger compared to the other two. However, it is worth noting since that each cluster contained an approximately equal number of objects.

The clusters displayed clear separation from each other, indicating distinct groupings within the dataset. This separation is a positive outcome as it suggests that the clustering algorithm successfully identified different patterns or characteristics among the data points.

Average silhouette:

#avg silhouette
library(cluster)
sil <- silhouette(km$cluster, dist(data_cluster))

rownames(sil) <- rownames(data_cluster)
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 2043          0.67
## 2       2 1702          0.49
## 3       3 5969          0.65

The average silhouette width, calculated as 0.63, indicates a reasonably high level of separation and cohesion within the clusters. As we mention before that a silhouette width above 0.5 is generally considered indicative of a reasonable degree of separation between data points within their assigned clusters. Therefore, the observed average silhouette width of 0.62 suggests that the clustering algorithm achieved a satisfactory level of discrimination. Comparing this value with the average silhouette width obtained from the 2-mean clustering solution would provide a better understanding of the relative improvement.

Total within-cluster sum of square:

# Total sum of squares
km$tot.withinss
## [1] 1232.939

A lower WSS value generally indicates better clustering results, suggesting that the data points within each cluster are more tightly grouped around their centroids. And our value here is less than the 2-mean value which mean this clustering is better.

BCubed precision and recall

cluster_assignments <- c(km$cluster)
ground_truth_labels <- c(data$stroke)
datacluster <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)

# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(datacluster) {
  n <- nrow(datacluster)
  precision_sum <- 0
  recall_sum <- 0
  
  for (i in 1:n) {
    cluster <- datacluster$cluster[i]
    label <- datacluster$label[i]
    
    # Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(datacluster$label[datacluster$cluster == cluster] == label)
    
# Count the total number of items in the same cluster
total_same_cluster <- sum(datacluster$cluster == cluster)
    
# Count the total number of items with the same category
total_same_category <- sum(datacluster$label == label)
    
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
  }

  # Calculate average precision and recall
  precision <- precision_sum / n
  recall <- recall_sum / n

  return(list(precision = precision, recall = recall))
}

# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(datacluster)

# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall

# Print the results
cat("BCubed Precision:", precision, "\n")
## BCubed Precision: 0.5693427
cat("BCubed Recall:", recall, "\n")
## BCubed Recall: 0.4817475

The BCubed precision (0.5693427) and recall (0.4817475) provides an evaluation of the clustering method’s agreement with the reference class label “stroke”. While a higher BCubed goodness value is generally desirable, the values are less than the 2 clusters.

- 5 Clusters:

#######cluster k=5
#calculate k-mean
km <- kmeans(data_cluster, 5, iter.max = 140 , algorithm="Lloyd", nstart=100)


#plot k-mean
fviz_cluster(list(data = data_cluster, cluster = km$cluster),
             ellipse.type = "norm", geom = "point", stand = FALSE,
             palette = "jco", ggtheme = theme_classic())

We utilized a value of k equal to 5, employing the 5-mean clustering method. The resulting cluster figure displayed five clusters of equal size. However, two of these clusters appeared to be in close proximity to each other, while the remaining clusters exhibited clear separation.

The closeness between two clusters can occur due to several reasons. It is possible that the data points in these clusters share similar characteristics or have overlapping features, leading to a reduced degree of separation. Alternatively, it could be an indication of limitations in the clustering algorithm’s ability to distinguish between these clusters effectively.

Average silhouette:

#avg silhouette
library(cluster)
sil <- silhouette(km$cluster, dist(data_cluster))
rownames(sil) <- rownames(data_cluster)
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 1520          0.61
## 2       2 4449          0.54
## 3       3 1467          0.59
## 4       4 2035          0.66
## 5       5  243          0.69

The observed value of 0.59 indicates a slightly lower level of discrimination and cohesion compared to the 3-mean clustering solution. The difference in average silhouette width between these two solutions suggests that the 3-mean clustering approach achieved a relatively higher level of separation and cohesion within each cluster.

Total within-cluster sum of square:

# Total sum of squares
km$tot.withinss
## [1] 670.6099

The lower WSS value compared to the 3-mean clustering solution (1232.939) suggests that the clusters in the 5-mean solution are more compact, indicating that the data points within each cluster are more tightly grouped around their centroids. This can be seen as a positive outcome, indicating improved clustering results in terms of compactness.

BCubed precision and recall

cluster_assignments <- c(km$cluster)
ground_truth_labels <- c(data$stroke)
datacluster <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)

# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(datacluster) {
  n <- nrow(datacluster)
  precision_sum <- 0
  recall_sum <- 0
  
  for (i in 1:n) {
    cluster <- datacluster$cluster[i]
    label <- datacluster$label[i]
    
    # Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(datacluster$label[datacluster$cluster == cluster] == label)
    
# Count the total number of items in the same cluster
total_same_cluster <- sum(datacluster$cluster == cluster)
    
# Count the total number of items with the same category
total_same_category <- sum(datacluster$label == label)
    
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
  }

  # Calculate average precision and recall
  precision <- precision_sum / n
  recall <- recall_sum / n

  return(list(precision = precision, recall = recall))
}

# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(datacluster)

# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall

# Print the results
cat("BCubed Precision:", precision, "\n")
## BCubed Precision: 0.5866183
cat("BCubed Recall:", recall, "\n")
## BCubed Recall: 0.3322907

the obtained BCubed precision (0.5866183) and recall (0.3322907) suggests that there is scope for improvement in terms of the clustering method’s agreement with the reference labels. This indicates that the clustering algorithm may not accurately capture the underlying number of clusters k.

- Optimal number of clusters:

The elbow method and average silhouette method are used to find the optimal number of clusters in a k-means clustering algorithm.

Elbow Method:

Elbow method with the Within-Cluster Sum of Squares (WSS) to find the optimal number of clusters in a k-means clustering algorithm.

#elbow with wss
fviz_nbclust(data_cluster, kmeans, method = "wss")+ labs(subtitle = "Elbow method")

Average silhouette method:

The silhouette method helps understand the quality of clustering by measuring how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The optimal number of clusters is often associated with a high average silhouette score.

########### avg  silhouette for all cluster
fviz_nbclust(data_cluster, kmeans, method = "silhouette")

The average silhouette approach determines how well each object lies within its cluster. A high average silhouette width (k=3) indicates a good clustering.

Findings:

Classification:

At the outset, our team carefully selected a dataset that represents valuable information about patients. Our goal was to utilize this data to predict the probability of individuals experiencing a stroke. By doing so, we aimed to provide people with the necessary knowledge and preventive measures to improve their overall well-being.

To ensure accurate and reliable results, we applied various preprocessing techniques to refine the dataset. These techniques helped us enhance the efficiency of the data and prepare it for analysis. Additionally, we employed several plotting methods to visually explore the dataset, allowing us to gain a deeper understanding of its characteristics and determine the most appropriate preprocessing steps.

Based on our observations from the plots and utilizing other relevant commands, we took steps to address any issues such as missing or outlier values. We removed these problematic instances from the dataset to prevent them from negatively impacting the accuracy of our predictions. Furthermore, we performed data transformation, which involved normalizing and discretizing certain attributes. This process aimed to ensure that all attributes carried equal weight and simplified data handling during subsequent data mining tasks.

Through these efforts, we strived to create an efficient and reliable predictive model that could effectively assist individuals in taking proactive measures to lead healthier lives.

Following the preprocessing stage, we proceeded to apply various methods, including the Gini index, gain ratio, and information gain, using different partitioning techniques. We carefully evaluated the outcomes of each method to determine the most suitable approach for our specific dataset . The results can be viewed in the tables below.

####GAIN RATIO:

70 % training set 30% testing set: 80 % training set 20% testing set: 85 % training set 15% testing set:
Accuracy 0.7967 0.795 0.78
Precision 0.7432 0.7323 0.7442
Sensitivity 0.9067 0.9300 0.8533
Specificity 0.6867 0.6600 0.7067

####GINI INDEX:

70 % training set 30% testing set: 80 % training set 20% testing set: 85 % training set 15% testing set:
Accuracy 0.74 0.78 0.74
Precision 0.6875 0.7414 0.7045
Sensitivity 0.8800 0.8600 0.8267
Specificity 0.6000 0.7000 0.6533

####INFORMATION GAIN:

70 % training set 30% testing set: 80 % training set 20% testing set: 85 % training set 15% testing set:
Accuracy 0.7433 0.76 0.7067
Precision 0.6780 0.6970 0.6505
Sensitivity 0.9267 0.9200 0.8933
Specificity 0.5600 0.6000 0.5200

After thorough analysis, we determined that the gain ratio method with (70% training set and 30% testing set) yielded the most accurate and reliable results for our dataset. Therefore, we confidently selected it as the optimal method for constructing a decision tree.

As discussed, Gain ratio is considered superior to information gain in certain scenarios as it addresses the potential bias introduced by attributes with high cardinality or numerous distinct values. Information gain measures the reduction in uncertainty achieved by splitting data based on an attribute, but it tends to favor attributes with more distinct values or partitions, which can introduce more information.This partition generally performed best throughout all the methods due to the fact that the model would have more data to train on, thus being more accurate.

In contrast, the gain ratio normalizes the information gain by considering the intrinsic information of the attribute. It takes into account the number of distinct values an attribute can have, penalizing attributes with high cardinality. By doing so, gain ratio provides a fairer comparison among attributes, considering both their information gain and the partitions they create.

The decision tree model yielded the following metrics:

  • Accuracy: The accuracy of the model is 79.67%.

  • Precision: The precision of the model is 74.32%.

  • Sensitivity: The sensitivity (or recall) of the model is 90.67%.

  • Specificity: The specificity of the model is 68.67%.

From the analysis of the decision tree, several findings can be observed:

  • Number of Leaf Nodes: The decision tree consists of 9 leaf nodes, indicating that 9 rules or distinct conditions have been extracted from the tree.

  • Important Attributes: The decision nodes in the tree utilize the attributes of age, ever married, and hypertension to make decisions and classify instances.

  • Age and Stroke: The analysis reveals that older individuals are more likely to have a stroke. This suggests that age is an important factor in determining the risk of stroke.

  • Hypertension and Stroke: The tree also indicates that individuals with higher levels of hypertension are more prone to suffering from a stroke. This implies that hypertension is a significant contributing factor to stroke risk.

These insights provide valuable information about the relationships between specific attributes and the occurrence of strokes. They can be utilized to raise awareness, inform preventive measures, and guide healthcare interventions for individuals at higher risk of stroke.

Clustering:

For Clustering, we used K-means algorithm with 3 different K to find the optimal number of clusters, we calculated the average silhouette width for each K, and we concluded the following

results:

K-means
Number of clusters K=2 K=3 K=5
Average silhouette width for each cluster 0.53 0.63 0.59
total within-cluster sum of square 2500.985 1232.939 670.6099
BCubed precision 0.5404063 0.5693427 0.5866183
BCubed recall 0.669987 0.4817475 0.3322907

• Number of cluster(K)= 2, the average silhouette width=0.53

• Number of cluster(K)= 3, the average silhouette width=0.63

• Number of cluster(K)= 5, the average silhouette width=0.59

Based on the analysis of the clustering results, it is evident that the choice of the number of clusters significantly impacts the quality of the clustering solution. The k-mean clustering approach aligns better with the nature of our dataset, which includes class labels indicating the presence or absence of a stroke (0 for no stroke and 1 for stroke). Therefore, the optimal number of clusters for this dataset is 2, despite the 3-mean clustering solution appearing visually better.

The class labels in our dataset indicate that a binary classification problem is at hand, separating instances into two distinct categories: stroke and no stroke. Since the class labels suggest the existence of two clusters, it is more appropriate to choose a 2-mean clustering solution to accurately capture these distinct patterns.

However, when evaluating the clustering results using metrics such as silhouette analysis, within-cluster sum of squares (WSS), and BCubed precision and recall, the 3-mean clustering solution may appear to have better performance. This discrepancy can be attributed to factors such as the inherent complexity of the data or the influence of other variables not directly captured by the class labels.

While the 3-mean clustering solution visually appears superior, it is crucial to prioritize the alignment with the class labels and the nature of the problem at hand. Therefore, based on the stroke classification problem and the binary class labels, the 2-mean clustering solution is deemed better in this scenario.

It is worth noting that the clustering algorithm’s ability to identify three clusters might be influenced by other factors present in the dataset that are not explicitly captured by the class labels. These factors could introduce additional complexity and patterns that the clustering algorithm attempts to capture by suggesting three clusters. However, for the specific problem of stroke prediction, our numbers choose 30-mean clustering but the binary nature of the class labels suggests that a 2-mean clustering solution is more appropriate and aligned with the desired outcome.

Classification or Clustering?

In conclusion, classification and clustering methods both serve an important part in machine learning. When considering our problem and dataset, it is recommended to utilize classification over clustering. This is due to the fact that classification is more suitable for predicting whether an individual will experience a stroke or not, which is our goal. Clustering can teach us more about our data and how the patients can be grouped based on common features, and by evaluating it to our ground truth (suffered a stroke or not) it can point out common attributes for each group, but it cannot predict whether an individual may suffer a stroke or not.

References:

[1] Utkarsh kumar, “Conditional Inference Trees in R Programming,” GeeksforGeeks, Jul. 06, 2020. https://www.geeksforgeeks.org/conditional-inference-trees-in-r-programming/

[2] B. Gorman, “Decision Trees in R using rpart,” www.gormanalysis.com, Aug. 24, 2014. https://www.gormanalysis.com/blog/decision-trees-in-r-using-rpart/

[3] aleshunas, “Classification of data using decision tree and regression tree methods,” Webster.edu, 2020. http://mercury.webster.edu/aleshunas/R_learning_infrastructure/Classification%20of%20data%20using%20decision%20tree%20and%20regression%20tree%20methods.html

[4] A. Thevapalan and J. Le, “R Decision Trees Tutorial: Examples & Code in R for Regression & Classification,” www.datacamp.com, Jan. 27, 2023. https://www.datacamp.com/tutorial/decision-trees-R

[5] S. HP, “Cluster analysis”:, MLearning.ai, Mar. 05, 2021. https://medium.com/mlearning-ai/cluster-analysis-6757d6c6acc9

[6]gkolluristudy, “Data Mining - Cluster Analysis,” GeeksforGeeks, Sep. 19, 2021. https://www.geeksforgeeks.org/data-mining-cluster-analysis/

Clustering sources:

Packages:

  • ggplot2: is a plotting package that provides helpful commands to create complex plots from data in a data frame.

  • factoextra: flexible and easy-to-use methods to extract quickly, in a human readable standard data format.

  • DPBBM: Beta-binomial Mixture Model is used to infer the pattern from count data. It can be used for clustering of RNA methylation sequencing data.

Libraires:

  • ggplot2: the grammar of graphics.

  • factoextra: to visualize the cluster.

  • cluster: to use silhouette method.

Methods:

  • as.numeric(): transformed into numeric types before clustering.

  • as.integer(): helps return their values as integer objects.

  • rownames(): to set rownames in the data frame

  • Kmeans(): run kmeans clustering to find N clusters.

  • fviz_cluster(): visualization of the clusters.

  • Silhouette(): calculate the average for each cluster.

  • fviz_silhouette(): visualization of clusters Silhouette and average Silhouette.

  • fviz_nbclust(): finding and visualization the best number of clusters.

  • BCubed_metric(): F metric parameter which used to average precision and recall.